Chapter1

1.3 What is the grammar of graphics?

The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes of geometric objects.
* Data
* Layers, made up of geomtric elements and statistical transformation.
* Scales
* Coordinate system
* Facet: how to break up the data into subsets and display those subsets as small multiples.
* Theme: controls the finer points of display, font size and background colour.
* ggplot2 can only create static graphics.

Chapter 2 Getting started with ggplot2

2.1 Introduction

2.2 Fuel economy data

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#library(magrittr)
mpg
## # A tibble: 234 x 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
##  1         audi         a4   1.8  1999     4   auto(l5)     f    18    29
##  2         audi         a4   1.8  1999     4 manual(m5)     f    21    29
##  3         audi         a4   2.0  2008     4 manual(m6)     f    20    31
##  4         audi         a4   2.0  2008     4   auto(av)     f    21    30
##  5         audi         a4   2.8  1999     6   auto(l5)     f    16    26
##  6         audi         a4   2.8  1999     6 manual(m5)     f    18    26
##  7         audi         a4   3.1  2008     6   auto(av)     f    18    27
##  8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
##  9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
  • cty and hwy record miles per gallon (mpg) for city and highway driving.

  • displ is the engine displacement in litres.

  • drv is the drive train: front wheel (f), rear wheel (r) or four wheel (4).

  • model is the model of car. There are 38 models, selected because they had a new edition every year between 1999 and 2008.

  • class (not shown), is a categorical variable describing the “type” of car: two seater, SUV, compact, etc.

2.3 Key components

  1. data
  2. Aesthetic mappings between variables in data and visual properties.
  3. Layer(s)
ggplot(mpg,aes(x=displ, y=hwy))+
  geom_point()

data: mpg
aesthetic mapping: engine size mapped to x position, fuel economy to y position.
layer: points
data and aesthetic mappings are supplied in ggplot(), then layers are added on with +.

2.4 Color, size, shape and other aesthetic attributes.

ggplot(mpg, aes(displ, cty, colour = class)) +
  geom_point()

This gives each point a unique colour corresponding to its class.

ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))

ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")

In the first plot, the value “blue” is scaled to a pinkish colour, and a legend is added. In the second plot, the points are given the R colour blue.
When using aesthetics in a plot, less is usually more.

2.5 Faceting

Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.
Wrapped is the most useful, so we’ll discuss it here, grid facetting later.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  facet_wrap(~class)

2.6 Plot geoms

2.6.1 Adding a smoother to a plot

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth() #se = TRUE
## `geom_smooth()` using method = 'loess'

This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence interval shown in grey.
* method = 'loess', default for small n, uses a smooth local regression, the wiggliness of the line is controlled by the span parameter (0,1).

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(span = 0.2)
## `geom_smooth()` using method = 'loess'

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(span = 1)
## `geom_smooth()` using method = 'loess'

  • method = 'gam' fits a generalised additive model provided by the mgcv package.
ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(method = "gam", formula = y ~ s(x))

  • `method = ‘lm’ fits a linear mdoel.
ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(method = "lm")

  • method = "rlm" works like lm(), but uses a robust fitting algorithm so that outliers don’t affect the fit as much.
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(method = "rlm")

2.6.2 Boxplots and jittered points

ggplot(mpg, aes(drv, hwy)) + 
  geom_point()

Jittering, geom_jitter(), adds a little random noise to the data which can help avoid overplotting.

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter()

Boxplots, geom_boxplot(), summarise the shape of the distribution with a handful of summary statistics.

ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot()

Violin plots, geom_violin(), show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.

ggplot(mpg, aes(drv, hwy)) + 
  geom_violin()

2.6.3 Histograms and frequency polygons

ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You should always try many bin widths, and you may find you need multiple bin widths to tell the full story of your data.

ggplot(mpg, aes(hwy)) + 
  geom_freqpoly(binwidth = 2.5)

ggplot(mpg, aes(hwy)) + 
  geom_freqpoly(binwidth = 1)

I’m not a fan of density plots because they are harder to interpret since the underlying computations are more complex.

ggplot(mpg, aes(hwy)) + 
  geom_density(binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth

ggplot(mpg, aes(displ, colour = drv)) + 
  geom_freqpoly(binwidth = 0.5)

ggplot(mpg, aes(displ, fill = drv)) + 
  geom_histogram(binwidth = 0.5) + 
  facet_wrap(~drv, ncol = 1)

2.6.4 Bar charts

unsummarised data:

ggplot(mpg, aes(manufacturer)) + 
  geom_bar() 

Presumarised data:

drugs <- data.frame(
  drug = c("a", "b", "c"),
  effect = c(4.2, 9.7, 6.1)
  )

ggplot(drugs, aes(drug, effect)) + 
  geom_bar(stat = "identity")

ggplot(drugs, aes(drug, effect)) + geom_point()

2.6.5 Time series with line and path plots

Line plots usually have time on the x-axis, showing how a single variable has changed over time.

ggplot(economics, aes(date, unemploy / pop)) + 
  geom_line()

ggplot(economics, aes(date, uempmed)) + 
  geom_line()

Path plots show how two variables have simultaneously changed over time, with time encoded in the way that observations are connected.

ggplot(economics, aes(unemploy / pop, uempmed)) + 
  geom_path() + 
  geom_point()

Because of the many line crossings, the direction in which time flows isn’t easy to see in the first plot.

year <- function(x) as.POSIXlt(x)$year + 1900
ggplot(economics, aes(unemploy / pop, uempmed)) + 
  geom_path(colour = "grey50") + 
  geom_point(aes(colour = year(date)))

2.7 Modifying the axes

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1 / 3)

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1 / 3) + 
  xlab("city driving (mpg)") + 
  ylab("highway driving (mpg)")

# Remove the axis labels with NULL
ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1 / 3) + 
  xlab(NULL) + 
  ylab(NULL)

xlim() and ylim() modify the limits of axes:

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25) 

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25) + 
  xlim("f", "r") + 
  ylim(20, 30)
## Warning: Removed 139 rows containing missing values (geom_point).

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25, na.rm = TRUE) + 
  ylim(NA, 30)

You can suppress the associated warning with na.rm = TRUE.

2.8 Output

p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) + 
  geom_point()

Render it on screen, with print().

print(p)

Save it to disk, with ggsave()

ggsave("plot.png", width = 5, height = 5)

Briefly describe its structure with summary().

summary(p)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy,
##   fl, class [234x11]
## mapping:  x = displ, y = hwy, colour = factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map: function
##     map_data: function
##     params: list
##     render_back: function
##     render_front: function
##     render_panels: function
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train: function
##     train_positions: function
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

Save a cached copy of it to disk, with saveRDS(). This saves a complete copy of the plot object, so you can easily re-create it with readRDS().

2.9 Quick plots

qplot(displ, hwy, data = mpg)

qplot(displ, data = mpg)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Unless otherwise specified, qplot() tries to pick a sensible geometry and statistic based on the arguments provided.

If you want to set an aesthetic to a constant, you need to use I():

qplot(displ, hwy, data = mpg, colour = "blue")

qplot(displ, hwy, data = mpg, colour = I("blue"))

Chapter3 Toolbox

3.1 Introduction

3.2 Basic plot types

Each of these geoms is two dimensional and requires both x and y aesthetics.
All of them understand colour (or color) and size aesthetics, and the filled geoms (bar, tile and polygon) also understand fill.

df <- data.frame(
  x = c(3, 1, 5),
  y = c(2, 4, 6),
  label = c("a","b","c")
  )
p <- ggplot(df, aes(x, y, label = label)) + 
  labs(x = NULL, y = NULL) + # Hide axis label 
  theme(plot.title = element_text(size = 12)) # Shrink plot title
  • geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines). Multiple groups will be stacked on top of each other.
p + geom_area() + ggtitle("area")

  • geom_bar(stat = "identity")makes a barchart. We need stat = "identity" because the default stat automatically counts values (so is essentially a 1d geom. The identity stat leaves the data unchanged. Multiple bars in the same location will be stacked on top of one another.
p + geom_bar(stat = "identity") + ggtitle("bar")

  • geom_line() makes a line plot. The group aesthetic determines which observations are connected
p + geom_line() + ggtitle("line")

  • geom_path() is similar to a geom_line(), but lines are connected in the order they appear in the data, not from left to right.
p + geom_path() + ggtitle("path")

  • geom_point() produces a scatterplot. geom_point() also understands the shape aesthetic.
p + geom_point() + ggtitle("point")

  • geom_polygon() draws polygons, which are filled paths.
p + geom_polygon() + ggtitle("polygon")

  • geom_rect() is parameterised by the four corners of the rectangle, xmin, ymin, xmax and ymax. geom_tile() is exactly the same, but parameterised by the center of the rect and its size, x, y, width and height.
    geom_raster() is a fast special case of geom_tile() used when all the tiles are the same size.
p + geom_tile() + ggtitle("raster")

3.3 Lables

geom_text() has the most aesthetics of any geom
* family gives the name of a font.

df <- data.frame(x = 1, y = 3:1, family = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) + 
  geom_text(aes(label = family, family = family))

  • fontface specifies the face: “plain” (the default), “bold” or “italic”.
df <- data.frame(x = 1, y = 3:1, face = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) + 
  geom_text(aes(label = face, fontface = face))

You can adjust the alignment of the text with the hjust (“left”, “center”, “right”, “inward”, “outward”) and vjust (“bottom”, “middle”, “top”, “inward”, “outward”) aesthetics.

df <- data.frame( 
  x = c(1, 1, 2, 2, 1.5), 
  y = c(1, 2, 1, 2, 1.5), 
  text = c( 
    "bottom-left", "bottom-right", 
    "top-left", "top-right", "center"
    )
  )
ggplot(df, aes(x, y)) + 
  geom_text(aes(label = text)) 

ggplot(df, aes(x, y)) + 
  geom_text(aes(label = text), vjust = "inward", hjust = 
              "inward")

  • size controls the font size.
  • angle specifies the rotation of the text in degrees.

  • The nudge_x and nudge_y parameters allow you to nudge the text a little horizontally or vertically:

df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) + 
  geom_point() + 
  geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) + 
  xlim(1, 3.6)

  • If check_overlap = TRUE, overlapping labels will be automatically removed.
ggplot(mpg, aes(displ, hwy)) + 
  geom_text(aes(label = model)) + 
  xlim(1, 8)

ggplot(mpg, aes(displ, hwy)) +
  geom_text(aes(label = model), check_overlap = TRUE) +
  xlim(1, 8)

A variation on geom_text() is geom_label(): it draws a rounded rectangle behind the text. This makes it useful for adding labels to plots with busy backgrounds:

label <- data.frame(
  waiting = c(55, 80),
  eruptions = c(2, 4.3),
  label = c("peak one", "peak two")
  )
ggplot(faithfuld, aes(waiting, eruptions)) +
  geom_tile(aes(fill = density)) +
  geom_label(data = label, aes(label = label))

ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point()

ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point(show.legend = FALSE) +
  directlabels::geom_dl(aes(label = class), method =
                          "smart.grid")

3.4 Annotations

  • geom_text() to add text descriptions or to label points

  • geom_rect() to highlight interesting rectangular regions of the plot.

  • geom_line(), geom_path() and geom_segment() to add lines.

ggplot(economics, aes(date, unemploy)) + 
  geom_line()

  • geom_vline(), geom_hline() and geom_abline() allow you to add reference lines (sometimes called rules), that span the full range of the plot.
presidential <- subset(presidential, start > economics$date[1])
ggplot(economics) + 
  geom_rect(
    aes(xmin = start, xmax = end, fill = party),
    ymin = -Inf, ymax = Inf, alpha = 0.2,
    data = presidential
    ) +
  geom_vline(
    aes(xintercept = as.numeric(start)),
    data = presidential,
    colour = "grey50", alpha = 0.5
    ) +
  geom_text(
    aes(x = start, y = 2500, label = name),
    data = presidential,
    size = 3, vjust = 0, hjust = 0, nudge_x = 50
    ) +
  geom_line(aes(date, unemploy)) +
  scale_fill_manual(values = c("blue", "red"))

yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have
                         varied a lot over the years", 40),
                 collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
  geom_line() +
  geom_text(
    aes(x, y, label = caption),
    data = data.frame(x = xrng[1], y = yrng[2], caption = caption),
    hjust = 0, vjust = 1, size = 4
    )

It’s easier to use the annotate() helper function which creates the data frame for you:

ggplot(economics, aes(date, unemploy)) + 
  geom_line() + 
  annotate("text", x = xrng[1], y = yrng[2], label = caption,hjust = 0, vjust = 1, size = 4 
           )

it’s much easier to see the subtle differences if we add a reference line.

ggplot(diamonds, aes(log10(carat), log10(price))) + 
  geom_bin2d() + 
  facet_wrap(~cut, nrow = 1)

mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) + 
  geom_bin2d() + 
  geom_abline(intercept = mod_coef[1], slope = mod_coef[2], 
              colour = "white", size = 1) + 
  facet_wrap(~cut, nrow = 1)

3.5 Collective geoms

An individual geom draws a distinct graphical object for each observation (row). For example, the point geom draws one point per row. A collective geom displays multiple observations with one geometric object.

By default, the group aesthetic is mapped to the interaction of all discrete variables in the plot.

data(Oxboys, package = "nlme")
head(Oxboys)
## Grouped Data: height ~ age | Subject
##   Subject     age height Occasion
## 1       1 -1.0000  140.5        1
## 2       1 -0.7479  143.4        2
## 3       1 -0.4630  144.8        3
## 4       1 -0.1643  147.1        4
## 5       1 -0.0027  147.7        5
## 6       1  0.2466  150.2        6

3.5.1 Multiple groups, one aesthetic

you want to be able to distinguish individual subjects, but not identify them.

ggplot(Oxboys, aes(age, height, group = Subject)) +
  geom_point() + 
  geom_line()

Incorrect:

ggplot(Oxboys, aes(age, height)) + 
  geom_point() + 
  geom_line()

3.5.2 Different groups on different layers

showing the overall trend for all boys.

ggplot(Oxboys, aes(age, height, group = Subject)) + 
  geom_line() + 
  geom_smooth(method = "lm", se = FALSE)

Instead of setting the grouping aesthetic in ggplot(), where it will apply to all layers, we set it in geom_line() so it applies only to the lines.

ggplot(Oxboys, aes(age, height)) + 
  geom_line(aes(group = Subject)) + 
  geom_smooth(method = "lm", size = 2, se = FALSE)

3.5.3 Overriding the default grouping

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot()

Now we want to overlay lines that connect each individual boy. Simply adding geom_line() does not work: the lines are drawn within each occassion, not across each subject.

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot() + 
  geom_line(colour = "#3366FF", alpha = 0.5)

we need to override the grouping to say we want one line per boy:

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot() + 
  geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)

3.5.4 Matching aesthetics to graphic objects

There is one more observation than line segment, and so the aesthetic for the first observation is used for the first segment, the second observation for the second segment and so on.This means that the aesthetic for the last observation is not used:

df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))
ggplot(df, aes(x, y, colour = factor(colour))) +
  geom_line(aes(group = 1), size = 2) + 
  geom_point(size = 5)

ggplot(df, aes(x, y, colour = colour)) +
  geom_line(aes(group = 1), size = 2) +
  geom_point(size = 5)

you can perform the linear interpolation yourself:

xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame( 
  x = xgrid,
  y = approx(df$x, df$y, xout = xgrid)$y,
  colour = approx(df$x, df$colour, xout = xgrid)$y
  )
ggplot(interp, aes(x, y, colour = colour)) +
  geom_line(size = 2) +
  geom_point(data = df, size = 5)

how would you colour a polygon that had a different fill colour for each point on its border?

ggplot(mpg, aes(class)) + 
  geom_bar()

ggplot(mpg, aes(class, fill = drv)) + 
  geom_bar()

If you try to map fill to a continuous variable in the same way, it doesn’t work.
To show multiple colours, we need multiple bars for each class, which we can get by overriding the grouping:

ggplot(mpg, aes(class, fill = hwy)) + 
  geom_bar()

ggplot(mpg, aes(class, fill = hwy, group = hwy)) +
  geom_bar()

3.6 Surface plots

ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_contour(aes(z = density, colour = ..level..))

ggplot(faithfuld, aes(eruptions, waiting))+
  geom_raster(aes(fill = density))

Bubble plots work better with fewer observations

small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) +
  geom_point(aes(size = density), alpha = 1/3) +
  scale_size_area()

3.7 Drawing maps

3.7.1 Vector boundaries

Vector boundaries are defined by a data frame with one row for each “corner” of a geographical region like a country, state, or county. It requires four variables:
* lat and long, giving the location of a point.
* group, a unique identifier for each contiguous region.
* id, the name of the region.

#mi_counties <- ggplot2::map_data("county", "michigan") %>% 
#  select(lon = long, lat, group, id = subregion)
#head(mi_counties)
#ggplot(mi_counties, aes(lon, lat)) +
#  geom_polygon(aes(group = group)) +
#  coord_quickmap()
#ggplot(mi_counties, aes(lon, lat)) +
#  geom_polygon(aes(group = group), fill = NA, colour = "grey50") +
#  coord_quickmap()

3.7.2 Point metadata

#mi_cities <- maps::us.cities %>%
#  tbl_df() %>%
#  filter(country.etc == "MI") %>%
#  select(-country.etc, lon = long) %>%
#  arrange(desc(pop))
#mi_cities

It’s not terribly useful without a reference. You almost always combine point metadata with another layer to make it interpretable.

#ggplot(mi_cities, aes(lon, lat)) +
#  geom_point(aes(size = pop)) +
#  scale_size_area() +
#  coord_quickmap()
#ggplot(mi_cities, aes(lon, lat)) +
#  geom_polygon(aes(group = group), mi_counties, fill = NA, colour = "grey50") +
#  geom_point(aes(size = pop), colour = "red") +
#  scale_size_area() +
#  coord_quickmap()

3.7.3 Raster images

3.8 Revealing uncertainty

  • Discrete x, range: geom_errorbar, geom_linerange()
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
base + geom_errorbar()

base + geom_linerange()

* Discrete x, range & center: geom_crossbar(),geom_pointrange()

base + geom_crossbar()

base + geom_pointrange()

  • Continuous x, range: geom_ribbon()
base + geom_ribbon()

  • Continuous x, range & center: geom_smooth(stat = "identity")
base + geom_smooth(stat = "identity")

3.9 Weighted data

There are two aesthetic attributes that can be used to adjust for weights. Firstly, for simple geoms like lines and points, use the size aesthetic:

Unweighted

ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point()

Weight by population

ggplot(midwest, aes(percwhite, percbelowpoverty)) +
  geom_point(aes(size = poptotal / 1e6)) +
  scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))

These weights will be passed on to the statistical summary function.
Unweighted

ggplot(midwest, aes(percwhite, percbelowpoverty)) +
  geom_point() +
  geom_smooth(method = lm, size = 1)

Weighted by population

ggplot(midwest, aes(percwhite, percbelowpoverty)) +
  geom_point(aes(size = poptotal / 1e6)) +
  geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
  scale_size_area(guide = "none")

The following code shows the difference this makes for a histogram of the percentage below the poverty line:

ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(binwidth = 1) +
  ylab("Counties")

ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(aes(weight = poptotal), binwidth = 1) +
  ylab("Population (1000s)")
## Warning: Ignoring unknown aesthetics: weight

3.10 Diamonds data

diamonds
## # A tibble: 53,940 x 10
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows

3.11 Displaying distributions

For 1d continuous distributions the most important geom is the histogram, geom_histogram():

ggplot(diamonds, aes(depth)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds, aes(depth)) +
  geom_histogram(binwidth = 0.1) +
  xlim(55, 70)
## Warning: Removed 45 rows containing non-finite values (stat_bin).

If you want to compare the distribution between groups, you have a few options:
* Show small multiples of the histogram, facet_wrap(~ var).

  • Use colour and a frequency polygon, geom_freqpoly().
    TODO: add na.rm back
ggplot(diamonds, aes(depth)) +
  geom_freqpoly(aes(colour = cut), binwidth = 0.1) +
  xlim(58, 68) +
  theme(legend.position = "none")
## Warning: Removed 669 rows containing non-finite values (stat_bin).
## Warning: Removed 10 rows containing missing values (geom_path).

  • Use a “conditional density plot”, geom_histogram(position = “fill”).
ggplot(diamonds, aes(depth)) + 
  geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill") +
  xlim(58, 68) +
  theme()
## Warning: Removed 669 rows containing non-finite values (stat_bin).

An alternative to a bin-based visualisation is a density estimate.
geom_density() places a little normal distribution at each data point and sums up all the curves.

ggplot(diamonds, aes(depth)) + 
  geom_density(na.rm = TRUE) +
  xlim(58, 68) +
  theme()

ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
  geom_density(alpha = 0.2, na.rm = TRUE) +
  xlim(58, 68) +
  theme()

sometimes you want to compare many distributions, and it’s useful to have alternative options that sacrifice quality for quantity. Here are three options: * geom_boxplot(): the box-and-whisker plot shows five summary statistics along with individual “outliers”.

ggplot(diamonds, aes(clarity, depth)) + 
  geom_boxplot()

ggplot(diamonds, aes(carat, depth)) +
  geom_boxplot(aes(group = cut_width(carat, 0.1))) +
  xlim(NA, 2.05)
## Warning: Removed 997 rows containing non-finite values (stat_boxplot).

  • geom_violin(): the violin plot is a compact version of the density plot.
ggplot(diamonds, aes(clarity, depth)) + 
  geom_violin()

ggplot(diamonds, aes(carat, depth)) +
  geom_violin(aes(group = cut_width(carat, 0.1))) +
  xlim(NA, 2.05)
## Warning: Removed 997 rows containing non-finite values (stat_ydensity).

  • geom_dotplot(): draws one point for each observation

3.12 Dealing with overplotting

  • Very small amounts of overplotting can sometimes be alleviated by making the points smaller, or using hollow glyphs.
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
norm + geom_point()

norm + geom_point(shape = 1) # Hollow circles

norm + geom_point(shape = ".") # Pixel sized

  • For larger datasets with more overplotting, you can use alpha blending (transparency) to make the points transparent.
norm + geom_point(alpha = 1 / 3)

norm + geom_point(alpha = 1 / 5)

norm + geom_point(alpha = 1 / 10)

  • you can randomly jitter the points to alleviate some overlaps with geom_jitter().

  • Bin the points and count the number in each bin, then visualise that count (the 2d generalisation of the histogram), geom_bin2d().

norm + geom_bin2d()

norm + geom_bin2d(bins = 10)

norm + geom_hex()

norm + geom_hex(bins = 10)

3.13 Statistical summaries

how we can count the number of diamonds in each bin:

ggplot(diamonds, aes(color)) + 
  geom_bar()

ggplot(diamonds, aes(color, price)) +
  geom_bar(stat = "summary_bin", fun.y = mean)

add na.rm back

ggplot(diamonds, aes(table, depth)) + 
  geom_bin2d(binwidth = 1) +
  xlim(50, 70) +
  ylim(50, 70)
## Warning: Removed 36 rows containing non-finite values (stat_bin2d).

ggplot(diamonds, aes(table, depth, z = price)) +
  geom_raster(binwidth = 1, stat = "summary_2d", fun = mean) +
  xlim(50, 70) +
  ylim(50, 70)
## Warning: Removed 36 rows containing non-finite values (stat_summary2d).